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Abstract 

The semijoin algebra is the variant of the relational algebra obtained by replac- 
ing the join operator by the semijoin operator. We provide an Ehrenfeucht-Fraisse 
game, characterizing the discerning power of the semijoin algebra. This game gives 
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^ ■ a method for showing that queries are not expressible in the semijoin algebra. 
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2 ! 1 Introduction 

O 

Semijoins are very important in the field of database query processing. While 
^ I computing project-join queries in general is NP-complete in the size of the 

^ I query and the database, this can be done in polynomial time when the database 

schema is acyclic [8], a property known to be equivalent to the existence of a 
semijoin program [3]. Semijoins are often used as part of a query pre-processing 
phase where dangling tuples are eliminated. Another interesting property is 
that the size of a relation resulting from a semijoin is always linear in the size 
of the input. Therefore, a query processor will try to use semijoins as often as 
possible when generating a query plan for a given query (a technique known 



* Corresponding author. 

Email addresses: dirk.leinders@luc.ac.be (Dirk Leinders), 
jty@mimuw.edu.pl (Jerzy Tyszkiewicz), jan.vandenbussche@luc.ac.be (Jan 
Van den Bussche). 

^ This author has been partially supported by the European Community Research 
Training Network "Games and Automata for Synthesis and Validation" (GAMES), 
contract HPRN-CT-2002-00283. 



Preprint submitted to Elsevier Science 



1 February 2008 



as "pushing projections" [7]). Also in distributed query processing, semijoins 
have great importance, because when a database is distributed across several 
sites, they can help avoid the shipment of many unneeded tuples. 

Because of its practical importance, we would like to have a clear knowledge of 
the capabilities and the limitations of semijoins. For example, Bernstein, Chiu 
and Goodman [4,5] have characterized the conjunctive queries computable 
by semijoin programs. In this paper, we consider the much larger class of 
queries computable in the variant of the full relational algebra obtained by 
replacing the join operator by the semijoin operator. We call this the semijoin 
algebra (SA). We will define an Ehrenfeucht-Fraisse game, that characterizes 
the discerning power of the semijoin algebra. Using this tool, we illustrate the 
borderline of expressibility of SA. 



2 Preliminciries 

In this section, we give a formal definition of the semijoin algebra. 

Prom the outset, we assume a universe U of basic data values, over which a 
number of predicates or relations are defined. These predicates can be com- 
bined into quantifier-free first-order formulas, which are used in selection and 
semijoin conditions. The names of these predicates and their arities are col- 
lected in the vocabulary fl. The equality predicate (=) is always in ft. A 
database schema is a finite set S of relation names, each associated with its 
arity. S is disjoint from Q. A database D over S is an assignment of a finite 
relation D{R) C to each i? e S, where n is the arity of R. 

Definition 1 (Semijoin algebra, SA) Let S be a database schema. Syntax 
and semantics of the Semijoin Algebra is inductively defined as follows: 

(1) Each relation i? G S belongs to SA. 

(2) If El, E2 G SA have arity n, then also Ei U E2, Ei — E2 belong to SA and 
are of arity n. 

(3) If El e SA has arity n and X is a subset of {1, ... ,n}, then 7rx{Ei) 
belongs to SA and is of arity 

(4) If El, E2 G SA have arities n and m, respectively, and Oi{xi, . . . , x„) and 
92{xi, . . . ,Xn,yi, . . . , ym) quantifier- free formulas over VL, then also 
(T0^{Ei) and Ei \Kq^ E2 belong to SA and are of arity n. 

The semantics of the selection and the semijoin operator are as follows: ae-^ (E) := 
{(ai, . . . , a„) G E \ 6*1(01, a„) holds}, Ei iXg^ E2 := {{ai, . . . , a„) G Ei \ 
3{bi,...,bm) e E2, ^2( ai, . . . , an, 61, ... , bjn) holds}. The semantics of the 
other operators are well known. 
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3 An Ehrenfeucht-Fraisse game for the semijoin algebra 

In this section, we describe an Ehrenfeucht-Prai'sse game that characterizes 
the discerning power of the semijoin algebra. 

Let A and B be two databases over the same schema S. The semijoin game on 
these databases is played by two players, called the spoiler and the duplicator. 
They, in turn, choose tuples from the tuple spaces Ta and Tb, which are 
defined as follows: Ta := Urgs U {nx{A{R)) | X C {1, . . . , arity(i?)}}, and Tb 
is defined analogously. So, the players can pick tuples from the databases and 
projections of these. 

At each stage in the game, there is a tuple a E Ta and a tuple b e Tb- We will 
denote such a configuration by {A, a; B,b). The conditions for the duplicator 
to win the game with rounds are: 

(1) Vi? e S,VX C {l,...,arity(i?)} : a e nx{A{R)) e nx{B{R)) 

(2) for every atomic formula (equivalently, for every quantifier-free formula) 
9 over Q, e{a) holds iff 9(8) holds. 

In the game with m > 1 rounds, the spoiler will be the first one to make a 
move. Therefore, he first chooses a database {A or B). Then he picks a tuple 
in Ta or in Tg respectively. The duplicator then has to make an "analogous" 
move in the other tuple space. When the duplicator can hold this for m times, 
no matter what moves the spoiler takes, we say that the duplicator wins the 
m-round semijoin game on A and B. The "analogous" moves for the duplicator 
are formally defined as legal answers in the next definition. 

Definition 2 (legal answer) Suppose that at a certain moment in the semi- 
join game, the configuration is {A, a; B,b). If the spoiler takes a tuple c E Ta 
in his next move, then the tuples d E Tb, for which the following conditions 
hold, are legal answers for the duplicator: 

(1) 'iRe S,VX C {l,...,arity(i?)} ■.de'Kx{B{R))^ce_n^{A{R)) 

(2) for every atomic formula 9 over Vt, 9{a,'c) holds iff9(b,d) holds. 

If the spoiler takes a tuple d E Tb, the legal answers c E Ta are defined 
identically. 

In the following, we denote the semijoin game with initial configuration {A,a; B,b) 
and that consists of m rounds, by Gm{A,a; B,b). 

We first state and prove 

Proposition 3 If the duplicator wins Gm{A,a; B,b), then for each semijoin 
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expression E with < m nested semijoins and projections, we have a e E{A) <^ 
beE{B). 



PROOF. We prove this by induction on m. The base case m = is clear. 
Now consider the case m > 0. Suppose that a & Ei t<g E2 but b ^ Ei t<e E2. 
Then a_e Ei{A) and_3c e E2{A) : ^(a,c), and either (*) b ^ Ei{B) or 
(**) -i3(i e E2{B) : 9{b,d). In situation (*), a and b are distinguished by an 
expression with m — 1 semijoins or projections, so the spoiler has a winning 
strategy; in situation (**), the spoiler has a winning strategy by choosing this 
c G E2{A) with 6'(a, c), because each legal answer of the duplicator d has 
6{b, d) and therefore d ^ E2{B). So, the spoiler now has a winning strategy in 
the game Gm-i{A,c; B, d). In case a projection distinguishes a and 6, a sim- 
ilar winning strategy for the spoiler exists. In case a and b are distinguished 
by an expression that is neither a semijoin, nor a projection, there is a sim- 
pler expression that distinguishes them, so the result follows by structural 
induction. 



We now come to the main theorem of the text. This theorem concerns the 
game Goo{A^a] B,b), which we also abbreviate as G{A,a;B,b). We say that 
the duplicator wins G{A,a] B,b) if the spoiler has no winning strategy. This 
means that the duplicator can keep on playing forever, choosing legal answers 
for every move of the spoiler. 

Theorem 4 The duplicator wins G{A, a; B, b) if and only if for each semijoin 
expression E, we have a e E{A) <^ 6 e E{B)- 



PROOF. The 'only if direction of the proof follows directly from Proposi- 
tion 3, because if the duplicator wins G{A^a\ B,b), he wins GmiA,a; B,b) for 
every m > 0. So, a and b are indistinguishable through all semijoin expres- 
sions. For the 'if direction, it is sufficient to prove that if the duphcator loses, 
a and b are distinguishable. We therefore construct, by induction, a semijoin 
expression E^ such that (i) a G E^{A), and (ii) b e E^{B) iff the duphcator 
wins Gr{A,a; B,b). We define E^ as 

n n ^x(i?)) - u u MR) 

Res {XCZ|a67rx{A(il))} ReS {XCZ|o07rx{A(R))} 

In this expression, Z is a shorthand for {!,..., arity(i?)} and 6a is the atomic 
type of a over Q, i.e., the conjunction of all atomic and negated atomic formulas 
over Q that are true of a. 
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Table 1 

Queries delineating the expressive power of the scmijoin algebra. 



Expressible 


Inexpressible 


Rx s n T 


RxS QT 


TCRx S 


T = RxS 




RoS n T 




TCRoS 




RoS QT 


3 path of length k 




3 simple path of length k (k < 2) 


3 simple path of length k {k >3) 


3 cycle of length k {k < 2) 


3 cycle of length k {k > 3) 




3> k elements {k > 3) 



We now construct in terms of E^' 



ceTA j=i e ceTA 

e{a,c) 

In this expression, ^ is the atomic type of a and c over Q; s is the maximal 
arity of a relation in S; ^ ranges over all atomic f2- types of two tuples, one with 
the arity of a, and one with arity j. The notation E'^°'^p^, for an expression of 
arity k, is a. shorthand for 

^-U U MR) 

«GSXC{l,...,arity(i?)} 
#X=k 



4 The expressive power of the semijoin algebra 



In this section, we present some queries that delineate the expressive power of 
the semijoin algebra. They are summarized in Table 1. The operation RoS 
for binary relations R and iS is a shorthand for 7ri^4^cr2=3(-R x S)^. 

We now discuss the results presented in the table. The semijoin algebra lacks 
the cartesian product operator, but nevertheless one can check if T C i? x 5. 
Indeed, T C RxS iST-{T n i? x 5) = 0, and T n RxS = {T t<x^=y^Ax2=y2 
R) Xx^=y-^Ax4,=y2S ■ Conversely, it is impossible to check if T D RxS. In Figure 1, 
two databases A and B are shown that are indistinguishable through semijoin 
expressions because the duplicator has an obvious winning strategy. But A 
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A{T) 
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a 


2 


b 


1 


b 


2 





B{T) 


a 


1 


a 


2 


b 


2 


b 


3 


c 


1 


c 


3 



Fig. l.ln A,T = Rx S, but not in B. 



AiR) 
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a 




a 
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3 


b 
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4 



A(T) 












S(T) 
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2 
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a 
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2 




1 


2 


3 


4 




3 


b 




a 


4 




3 


4 



Fig. 2. In A, T = i? o 5, but in B neither T C i? o S" nor T D i? o 5. 

satisfies T ^ R x S and S does not. The same databases actuaUy show that 
it is impossible to check HT = R x S. 

Although one can check in SA if a relation is contained in a cartesian product, 
it is impossible to check if a relation is contained in or subsumed by a join. 
Using our semijoin game, one can show that databases A and B in Figure 2 
satisfy the same semijoin expressions. But A satisfies T — R o S, while B 
satisfies neither T C R o S nor T ^ R o S. Note that a binary relation R is 
transitive if and only if R o R G R. This is a special case of Ro S G T; yet, a 
similar argument shows that transitivity is also inexpressible in the semijoin 
algebra. 

The existence of a path of length k can be checked with the following induc- 
tively defined semijoin expression: 



J path(l) := R 
[ path(A;) := R 



X2=yi 



^path(A; — 1 



Problems arise when we require the path to be simple. Let D^''^ be the struc- 
ture {(1, 2), (2, 3), . . . , (/c — 1, fc), {k, 1)} over the schema S containing a single 
edge relation R. Then, the duplicator has a winning strategy in the infinite 
game played on D^'^^ and where A; > 4. To see this, note that only three 

types of moves are possible here: next tuple (change only first component of 
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pebbled tuple), previous tuple (change only second component) and other tu- 
ple (change both components). The duplicator can answer every type of move 
of the spoiler. But D^''^^^ contains a simple path of length k and D^^^ does not. 
For k^3, note that r>(^) and D^^^ are distinguishable. Nevertheless, existence 
of a simple path of length 3 is still inexpressible because D^'^^ is indistinguish- 
able from the structure consisting of two disjoint copies of D^^\ For k — 2, 
the existence of a path of length 2 is expressible as R ^xi=y^Nxi+xx\yi+xi R- 

Another property that is inexpressible in SA is the existence of a cycle of 
length k. For /c > 4, the inexpressibility result follows because D^''^ contains a 
cycle of length k and D^^^^^ does not. For A; = 3, that the structure consisting 
of two disjoint copies of D^^^ contains a cycle of length 3, but D^^^ does not. 

A last example of a query that is inexpressible in SA is the query that asks if 
there are at least k elements in a unary relation 5*, where A; > 3. This property 
is inexpressible because the duplicator has a winning strategy in the infinite 
game played on two relations, one with 2 and one with k distinct elements. 



5 Impact of order 



In this section, we investigate the impact of order. On ordered databases 
(where Q now also contains a total order on the domain), the query that 
asks if there arc at least k elements in a unary relation S becomes expressible 
as atJeast(A;), which is inductively defined as follows: 

J at_least(l) := S 

y atJeast(A;) := S ^x'L<yi (atJeast(A; — l)j 

Note that this query is independent of the order. This is very interesting 
because in first-order logic, there also exists an order-invariant query that is 
expressible with but inexpressible without order ([1, Exercise 17.27] and [6, 
Proposition 2.5.6]). 

Some inexpressible queries presented in Section 4 remain inexpressible on or- 
dered databases. An example is the query R x S C T. Indeed, consider the 
following databases A and B: A{R) = B{R) = {1, 2, . . . , m}, A{S) = B{S) = 
{m + l,m + 2,...,2m}, A{T) = A{R) x A{S) and S(T) = A{T)-{{^,m + 
^^)}. We will show that when m — 2n + 1, the duphcator has a winning 
strategy in the n-round semijoin game Gn{A, (); B, ()) with D, — {—, <}. From 
Lemma 3, it then follows that the query i? x C T is inexpressible in SA. 
The duplicator's winning strategy consists of playing exact match until the 
spoiler chooses c to be the special tuple (^^^, m + ^^^) in A. In that case we 
must distinguish five possibilities for the previous tuple a: (1) Oi = ^^y^, (2) 
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ai = Hi^, (3) ai = ^ and as = m + (4) ai = ^ and a2 = m + ^ 
and (5) all other cases. The duplicator chooses d equal to (^^^^, rn + ^^y^) in 
easel, (^,m + ^)in case 2, {^,m + ^) in case 3, {^,m+^) in 
case 4, and (^^^, m + ^^^) in case 5. Let us assume case 1 applies; cases 2 to 
5 are analogous. Then, there are two possibihties. First, if the spoiler chooses 
a value ci 7^ ai — 1 or if he chooses a value di 7^ 61 + 1 in some next round, 
the duplicator can play exact match and the game starts over. Second, if the 
spoiler chooses in each next round Ci = ai — 1 or = 61 + 1, the duplicator 
answers = 61 — 1 or ci = ai + 1, respectively. The duplicator can follow this 
strategy for at least — n — 1 rounds. Counting from the round where the 
spoiler chose the special tuple, we thus see that the duplicator wins the game 
Gn{A, {);B,{)). 

Exactly the same argument shows that also the query Rx S — T is inexpress- 
ible in SA with order. 

Another query from Table 1 that remains inexpressible in SA with order is 
R o S T. Therefore, consider the following databases A and B: A[R) = 
B{R) = {!,..., m}x {2m+l}, A{S) = B{S) = {2m+l} x {m+ 1, . . . , 2m}, 
A{T) = A{R) o A{S) = {1, . . . , m} X {m + 1, . . . , 2m} and B{T) = B{R) o 
B{S) — {(^^^, m + ^^Y^)}. A similar argument as in the previous paragraph 
shows that when m = 2n + 1, the duplicator wins Gn{A, (); B, ()). Again, this 
also shows that Ro S = T is inexpressible in SA with order. 

For the remaining S A- inexpressible queries in Table 1, the question whether 
they become expressible in SA with order remains open. 



6 Concluding reiiicirks 

Interestingly, there is a fragment of first-order logic very similar to the semijoin 
algebra: it is the so called "guarded fragment" (GF) [2], which has been studied 
in the field of modal logic. This is interesting because the motivations to study 
this fragment came purely from the field of logic and had nothing to do with 
database query processing. Indeed, the purpose was to extend propositional 
modal logic to the predicate level, retaining the good properties of modal 
logic, such as the finite model property. An important tool in the study of 
the expressive power of the GF is the notion of "guarded bisimulation" , which 
provides a characterization of the discerning power of the GF. 

When we only allow conjunctions of equalities to be used in the semijoin condi- 
tions, SA is subsumed by GF, and conversely, every GF sentence is expressible 
in SA. 
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When negations of equalities are allowed in semijoin conditions, however, SA is 
no longer subsumed by GF. A counterexample is the query that asks whether 
there are at least two distinct elements in a single unary relation 5". This is 
expressible in SA as 5* i<xi^yi but it is not expressible in GF. Proofs of the 
claims presented in this section will be presented in a separate paper. 
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